{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assignment_07: First step of using machine learning and models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据分析"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "import re\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import time, timeit\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "fname = r'D:\\GitHub\\Data\\Xinhuashe\\sqlResult_1558435.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "content = pd.read_csv(fname, encoding='gb18030')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>author</th>\n",
       "      <th>source</th>\n",
       "      <th>content</th>\n",
       "      <th>feature</th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>89617</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"37\"...</td>\n",
       "      <td>小米MIUI 9首批机型曝光：共计15款</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623597.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>89616</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"15\"...</td>\n",
       "      <td>骁龙835在Windows 10上的性能表现有望改善</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623599.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>89615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"18\"...</td>\n",
       "      <td>一加手机5细节曝光：3300mAh、充半小时用1天</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623601.htm</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id author                  source  \\\n",
       "0  89617    NaN  快科技@http://www.kkj.cn/   \n",
       "1  89616    NaN  快科技@http://www.kkj.cn/   \n",
       "2  89615    NaN  快科技@http://www.kkj.cn/   \n",
       "\n",
       "                                             content  \\\n",
       "0  此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...   \n",
       "1  骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...   \n",
       "2  此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...   \n",
       "\n",
       "                                             feature  \\\n",
       "0  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"37\"...   \n",
       "1  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"15\"...   \n",
       "2  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"18\"...   \n",
       "\n",
       "                        title                                             url  \n",
       "0        小米MIUI 9首批机型曝光：共计15款  http://www.cnbeta.com/articles/tech/623597.htm  \n",
       "1  骁龙835在Windows 10上的性能表现有望改善  http://www.cnbeta.com/articles/tech/623599.htm  \n",
       "2   一加手机5细节曝光：3300mAh、充半小时用1天  http://www.cnbeta.com/articles/tech/623601.htm  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "content.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据预处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "89611"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cal time\n",
    "def clock(func):\n",
    "    def clocked(*args, **kwargs):\n",
    "        t0 = timeit.default_timer()\n",
    "        result = func(*args, **kwargs)\n",
    "        elapsed = timeit.default_timer() - t0\n",
    "        #name = func.__name__\n",
    "        #print(name,' elapsed: ',elapsed,' s')\n",
    "        print(elapsed, 's')\n",
    "        return result\n",
    "    return clocked"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.0344137999999958 s\n",
      "content (38729226, 432.1927665130397)\n",
      "1.0425247000000013 s\n",
      "feature (53450994, 596.4780439901351)\n",
      "1.0204907999999975 s\n",
      "title (2123231, 23.69386570845097)\n",
      "1.0224979999999988 s\n",
      "url (7472901, 83.39267500641662)\n"
     ]
    }
   ],
   "source": [
    "# look at each arrtibutes\n",
    "@clock\n",
    "def get_size(data, attr):\n",
    "    size = 0\n",
    "    data = data[attr]\n",
    "    for i in range(len(data)):\n",
    "        if isinstance(data[i], str):\n",
    "            size += len(data[i])\n",
    "    return size, size / len(data)\n",
    "attributes = ['content','feature', 'title', 'url']\n",
    "show = [print(i, get_size(df, i)) for i in attributes]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(content, columns=['content', 'source'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "87052"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compare(x):\n",
    "    if x == '新华社':\n",
    "        return 1\n",
    "    return 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['source'] = df['source'].apply(compare)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\\r\\n@深圳交警微博称：昨日清...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>受到A股被纳入MSCI指数的利好消息刺激，A股市场从周三开始再度上演龙马行情，周四上午金...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>虽然至今夏普智能手机在市场上无法排得上号，已经完全没落，并于 2013 年退出中国市场，但是...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>沙漠雄鹰：震荡有利消化套牢筹码\\r\\n　　周四开盘上证50在银行券商大蓝筹带动下一度涨近...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>（原标题：武汉警方一下子抓了808人，还都是俊男靓女！原来他们每天偷偷摸摸干这事！）\\r\\n...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>6月21日，A股纳入MSCI指数尘埃落定，但当天被寄予厚望的券商股并未扛起反弹大旗。22...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             content  source\n",
       "0  此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...       0\n",
       "1  骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...       0\n",
       "2  此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...       0\n",
       "3    这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n       1\n",
       "4  （原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\\r\\n@深圳交警微博称：昨日清...       0\n",
       "5  　　受到A股被纳入MSCI指数的利好消息刺激，A股市场从周三开始再度上演龙马行情，周四上午金...       0\n",
       "6  虽然至今夏普智能手机在市场上无法排得上号，已经完全没落，并于 2013 年退出中国市场，但是...       0\n",
       "7  　　沙漠雄鹰：震荡有利消化套牢筹码\\r\\n　　周四开盘上证50在银行券商大蓝筹带动下一度涨近...       0\n",
       "8  （原标题：武汉警方一下子抓了808人，还都是俊男靓女！原来他们每天偷偷摸摸干这事！）\\r\\n...       0\n",
       "9  　　6月21日，A股纳入MSCI指数尘埃落定，但当天被寄予厚望的券商股并未扛起反弹大旗。22...       0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "87052\n",
      "78661\n",
      "0.903609336948031\n"
     ]
    }
   ],
   "source": [
    "print(len(df))\n",
    "print(len(df[df['source'] == 1]))\n",
    "print(len(df[df['source'] == 1]) / len(df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_pos = df[df['source'] == 1]\n",
    "data_neg = df[df['source'] == 0]\n",
    "\n",
    "data_sampled = data_pos.append(data_neg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(data_sampled)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>source</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>新华社韩国济州6月18日电综述：亚投行第二届年会三大亮点\\r\\n新华社记者 耿学鹏 严蕾\\r...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>新华社北京6月18日电 经军委领导批准，《军营理论热点怎么看·2017》日前印发全军。\\r\\...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>305</th>\n",
       "      <td>新华社兰州6月19日电（记者张钦）记者19日了解到，刚刚出台的《甘肃省网络扶贫行动的实施方案...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>新华社乌鲁木齐6月18日电（记者杜刚、王秉阳）6月18日，新疆天山天池首次举办山地马拉松比赛...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>314</th>\n",
       "      <td>新华社北京6月19日电（记者熊丰）记者从公安部消防局获悉，19日上午6时，由公安部消防局组织...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>543</th>\n",
       "      <td>暑期旅游旺季是各大旅游景区开门迎客的好时机。然而，有5家5A级景区却高兴不起来：国家旅游...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>545</th>\n",
       "      <td>新华社北京6月20日电　总部设在德国波恩的国际残奥委会在当地时间19日宣布，中国残联主席...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>651</th>\n",
       "      <td>新华社南京4月17日电(记者潘晔)江苏拥有不少5A景区。日前，一份来自清华大学的第三方调...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               content  source\n",
       "3      这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n       1\n",
       "39     这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n       1\n",
       "95   新华社韩国济州6月18日电综述：亚投行第二届年会三大亮点\\r\\n新华社记者 耿学鹏 严蕾\\r...       1\n",
       "113  新华社北京6月18日电 经军委领导批准，《军营理论热点怎么看·2017》日前印发全军。\\r\\...       1\n",
       "305  新华社兰州6月19日电（记者张钦）记者19日了解到，刚刚出台的《甘肃省网络扶贫行动的实施方案...       1\n",
       "312  新华社乌鲁木齐6月18日电（记者杜刚、王秉阳）6月18日，新疆天山天池首次举办山地马拉松比赛...       1\n",
       "314  新华社北京6月19日电（记者熊丰）记者从公安部消防局获悉，19日上午6时，由公安部消防局组织...       1\n",
       "543  　　暑期旅游旺季是各大旅游景区开门迎客的好时机。然而，有5家5A级景区却高兴不起来：国家旅游...       1\n",
       "545  　　新华社北京6月20日电　总部设在德国波恩的国际残奥委会在当地时间19日宣布，中国残联主席...       1\n",
       "651  　　新华社南京4月17日电(记者潘晔)江苏拥有不少5A景区。日前，一份来自清华大学的第三方调...       1"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_sampled[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cut(string):\n",
    "    return ' '.join(jieba.cut(re.sub(r'[^\\w\\s]', ' ', string.strip())))\n",
    "\n",
    "def pre_process(data):\n",
    "    x_input = data.loc[:,['content']]\n",
    "    y_input = data['source'].values\n",
    "    return x_input, y_input\n",
    "x_inputs, y_inputs = pre_process(data_sampled)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>新华社韩国济州6月18日电综述：亚投行第二届年会三大亮点\\r\\n新华社记者 耿学鹏 严蕾\\r...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>新华社北京6月18日电 经军委领导批准，《军营理论热点怎么看·2017》日前印发全军。\\r\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>305</th>\n",
       "      <td>新华社兰州6月19日电（记者张钦）记者19日了解到，刚刚出台的《甘肃省网络扶贫行动的实施方案...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>312</th>\n",
       "      <td>新华社乌鲁木齐6月18日电（记者杜刚、王秉阳）6月18日，新疆天山天池首次举办山地马拉松比赛...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>314</th>\n",
       "      <td>新华社北京6月19日电（记者熊丰）记者从公安部消防局获悉，19日上午6时，由公安部消防局组织...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>543</th>\n",
       "      <td>暑期旅游旺季是各大旅游景区开门迎客的好时机。然而，有5家5A级景区却高兴不起来：国家旅游...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>545</th>\n",
       "      <td>新华社北京6月20日电　总部设在德国波恩的国际残奥委会在当地时间19日宣布，中国残联主席...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>651</th>\n",
       "      <td>新华社南京4月17日电(记者潘晔)江苏拥有不少5A景区。日前，一份来自清华大学的第三方调...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               content\n",
       "3      这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n\n",
       "39     这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n\n",
       "95   新华社韩国济州6月18日电综述：亚投行第二届年会三大亮点\\r\\n新华社记者 耿学鹏 严蕾\\r...\n",
       "113  新华社北京6月18日电 经军委领导批准，《军营理论热点怎么看·2017》日前印发全军。\\r\\...\n",
       "305  新华社兰州6月19日电（记者张钦）记者19日了解到，刚刚出台的《甘肃省网络扶贫行动的实施方案...\n",
       "312  新华社乌鲁木齐6月18日电（记者杜刚、王秉阳）6月18日，新疆天山天池首次举办山地马拉松比赛...\n",
       "314  新华社北京6月19日电（记者熊丰）记者从公安部消防局获悉，19日上午6时，由公安部消防局组织...\n",
       "543  　　暑期旅游旺季是各大旅游景区开门迎客的好时机。然而，有5家5A级景区却高兴不起来：国家旅游...\n",
       "545  　　新华社北京6月20日电　总部设在德国波恩的国际残奥委会在当地时间19日宣布，中国残联主席...\n",
       "651  　　新华社南京4月17日电(记者潘晔)江苏拥有不少5A景区。日前，一份来自清华大学的第三方调..."
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_inputs[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 0, 0, 0], dtype=int64)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用tfidf进行文本向量化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(87052, 500)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# (?u)进入re.u，unicode匹配模式，https://stackoverflow.com/questions/35043085/what-does-u-do-in-a-regex\n",
    "vectorizer = TfidfVectorizer(max_features=500, \n",
    "                             token_pattern=r\"(?u)\\b\\w+\\b\", \n",
    "                             max_df = 1.0,\n",
    "                             stop_words = None,\n",
    "                             vocabulary = None )\n",
    "vec = None\n",
    "vec_fea = vectorizer.fit_transform(x_inputs['content'].values)\n",
    "np_vec = vec_fea.toarray()\n",
    "\n",
    "if vec is None:\n",
    "    vec = np_vec\n",
    "else:\n",
    "    vec = np.hstack((vec,np_vec))\n",
    "vec.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参考scikit-learning的方法，构建你的第一个机器学习模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://blog.csdn.net/hfutdog/article/details/88085878  \n",
    "F1 score越高，说明模型越稳健  \n",
    "精确率指模型预测为正的样本中实际也为正的样本占被预测为正的样本的比例  \n",
    "准确率是分类正确的样本占总样本个数的比例  \n",
    "召回率指实际为正的样本中被预测为正的样本所占实际为正的样本的比例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import f1_score\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import precision_score\n",
    "from sklearn.metrics import recall_score\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "@clock\n",
    "def get_performance(clf, x_, y_):\n",
    "    y_hat = clf.predict(x_)\n",
    "    print('f1_score is: {}'.format(f1_score(y_, y_hat)))\n",
    "    print('accuracy is: {}'.format(accuracy_score(y_, y_hat)))\n",
    "    print('percision is: {}'.format(precision_score(y_, y_hat)))\n",
    "    print('recall is: {}'.format(recall_score(y_, y_hat)))\n",
    "    print('roc_auc is: {}'.format(roc_auc_score(y_, y_hat)))\n",
    "    print('confusion matrix: \\n{}'.format(confusion_matrix(y_, y_hat, labels=[0, 1])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 在traning_data, validation_data, test_data 上观察其相关metric: recall, precision, f1等， 并解释其含义. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "@clock\n",
    "def KNN(x_train, y_train):\n",
    "    clf = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform', algorithm = 'auto')\n",
    "    clf.fit(x_train, y_train)\n",
    "    return clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train,x_test, y_train, y_test = train_test_split(\n",
    "    vec , y_inputs, train_size = 0.8, test_size=0.2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "15.65492230000018 s\n",
      "f1_score is: 0.9914047384947191\n",
      "accuracy is: 0.9844351272184252\n",
      "percision is: 0.9901171998732974\n",
      "recall is: 0.9926956300813008\n",
      "roc_auc is: 0.9495571731690247\n",
      "confusion matrix: \n",
      "[[ 1511   156]\n",
      " [  115 15629]]\n",
      "346.68970140000056 s\n"
     ]
    }
   ],
   "source": [
    "get_performance(KNN(x_train, y_train), x_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train,x_test, y_train, y_test = train_test_split(\n",
    "    vec , y_inputs, train_size = 0.85, test_size=0.15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "17.63165089999984 s\n",
      "f1_score is: 0.9908749786798567\n",
      "accuracy is: 0.9836115791085924\n",
      "percision is: 0.9898619867098313\n",
      "recall is: 0.9918900460986854\n",
      "roc_auc is: 0.9516741897160094\n",
      "confusion matrix: \n",
      "[[ 1225   119]\n",
      " [   95 11619]]\n",
      "267.3778229999998 s\n"
     ]
    }
   ],
   "source": [
    "get_performance(KNN(x_train, y_train), x_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "@clock\n",
    "def KNN(x_train, y_train):\n",
    "    clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance', algorithm = 'auto')\n",
    "    clf.fit(x_train, y_train)\n",
    "    return clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train,x_test, y_train, y_test = train_test_split(\n",
    "    vec , y_inputs, train_size = 0.85, test_size=0.15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "18.541930000000008 s\n",
      "f1_score is: 0.991046042860174\n",
      "accuracy is: 0.9838413233266963\n",
      "percision is: 0.9925201869953251\n",
      "recall is: 0.9895762711864406\n",
      "roc_auc is: 0.9598119829700088\n",
      "confusion matrix: \n",
      "[[ 1170    88]\n",
      " [  123 11677]]\n",
      "262.3214868 s\n"
     ]
    }
   ],
   "source": [
    "get_performance(KNN(x_train, y_train), x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 找出所以预测为 1， 但是实际为 0 的文章。 作为抄袭的候选者。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "17.527140599999996 s\n"
     ]
    }
   ],
   "source": [
    "knc = KNN(x_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = knc.predict(np_vec)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((87052,), (87052,))"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_pred.shape, y_inputs.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_sampled['content_predict'] = y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>source</th>\n",
       "      <th>content_predict</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10968</th>\n",
       "      <td>我军新设立“八一勋章”\\r\\n</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10969</th>\n",
       "      <td>??成都商报讯（记者 张瑾）成都2017年高中阶段教育学校统一招生考试13、14日（简阳市为...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10970</th>\n",
       "      <td>本报讯（记者 聂春洁）记者昨日从铁路部门获悉，全国铁路图7月1日将迎来大调整，届时备受关注的...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10971</th>\n",
       "      <td>传奇收藏版包括：73厘米高的Bayek与Senu（宠物鹰）雕像、限量编号证书（编号从1至99...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10972</th>\n",
       "      <td>中新社 刘震 摄\\r\\n“我在这里向你们郑重承诺：中国政府将为你们的参与创造尽可能优惠的条件...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10973</th>\n",
       "      <td>郑报融媒记者 张立 石闯 顾翔 文/图\\r\\n【脱贫攻坚·记者蹲点】\\r\\n夏日里，乱石坡房...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10974</th>\n",
       "      <td>????新华社电?经中央军委主席习近平批准，我军新设立“八一勋章”，并组织开展首次评选。新设...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10975</th>\n",
       "      <td>??创新 务实\\r\\n??理性 开朗\\r\\n??今日市区 中雨转大雨 21℃~28.6℃ 南...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10977</th>\n",
       "      <td>意见明确了加强和完善城乡社区治理要坚持的五项基本原则，即坚持党的领导，固本强基；坚持以人...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10978</th>\n",
       "      <td>据新华社北京6月12日电 经中央军委主席习近平批准，我军新设立“八一勋章”，并组织开展首次评...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 content  source  \\\n",
       "10968                                    我军新设立“八一勋章”\\r\\n       0   \n",
       "10969  ??成都商报讯（记者 张瑾）成都2017年高中阶段教育学校统一招生考试13、14日（简阳市为...       0   \n",
       "10970  本报讯（记者 聂春洁）记者昨日从铁路部门获悉，全国铁路图7月1日将迎来大调整，届时备受关注的...       0   \n",
       "10971  传奇收藏版包括：73厘米高的Bayek与Senu（宠物鹰）雕像、限量编号证书（编号从1至99...       0   \n",
       "10972  中新社 刘震 摄\\r\\n“我在这里向你们郑重承诺：中国政府将为你们的参与创造尽可能优惠的条件...       0   \n",
       "10973  郑报融媒记者 张立 石闯 顾翔 文/图\\r\\n【脱贫攻坚·记者蹲点】\\r\\n夏日里，乱石坡房...       0   \n",
       "10974  ????新华社电?经中央军委主席习近平批准，我军新设立“八一勋章”，并组织开展首次评选。新设...       0   \n",
       "10975  ??创新 务实\\r\\n??理性 开朗\\r\\n??今日市区 中雨转大雨 21℃~28.6℃ 南...       0   \n",
       "10977  　　意见明确了加强和完善城乡社区治理要坚持的五项基本原则，即坚持党的领导，固本强基；坚持以人...       0   \n",
       "10978  据新华社北京6月12日电 经中央军委主席习近平批准，我军新设立“八一勋章”，并组织开展首次评...       0   \n",
       "\n",
       "       content_predict  \n",
       "10968                0  \n",
       "10969                0  \n",
       "10970                0  \n",
       "10971                0  \n",
       "10972                0  \n",
       "10973                0  \n",
       "10974                0  \n",
       "10975                0  \n",
       "10977                0  \n",
       "10978                0  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_sampled[-10:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 实际为0， 预测为1\n",
    "copy_content = data_sampled[(data_sampled.source == 0) & (data_sampled.content_predict == 1)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>content</th>\n",
       "      <th>source</th>\n",
       "      <th>content_predict</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>经济日报·中国经济网记者 廉 丹\\r\\n香港回归祖国20周年前夕，香港特区行政长官梁振英在官...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>192</th>\n",
       "      <td>打架现场\\r\\n凤凰体育讯北京时间今晚19点35分， 中超第13轮，万众瞩目的“京津德比”将...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>290</th>\n",
       "      <td>中证网讯 （记者 周佳）经过昨日的冲高回落后，23日早盘，沪深两市受下方均线系统的支撑，...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>380</th>\n",
       "      <td>央广网阿克苏6月23日消息（记者蒋雪娇 阿克苏台记者张鸿海 盛月 陈婉仪）近年来，阿克苏市针...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>424</th>\n",
       "      <td>2017年6月15日，以“奥林匹克，瞬间的永恒”为主题的“2017北京奥林匹克博览会”在...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                               content  source  \\\n",
       "127  经济日报·中国经济网记者 廉 丹\\r\\n香港回归祖国20周年前夕，香港特区行政长官梁振英在官...       0   \n",
       "192  打架现场\\r\\n凤凰体育讯北京时间今晚19点35分， 中超第13轮，万众瞩目的“京津德比”将...       0   \n",
       "290  　　中证网讯 （记者 周佳）经过昨日的冲高回落后，23日早盘，沪深两市受下方均线系统的支撑，...       0   \n",
       "380  央广网阿克苏6月23日消息（记者蒋雪娇 阿克苏台记者张鸿海 盛月 陈婉仪）近年来，阿克苏市针...       0   \n",
       "424  　　2017年6月15日，以“奥林匹克，瞬间的永恒”为主题的“2017北京奥林匹克博览会”在...       0   \n",
       "\n",
       "     content_predict  \n",
       "127                1  \n",
       "192                1  \n",
       "290                1  \n",
       "380                1  \n",
       "424                1  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "copy_content.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "105"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(copy_content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 总结该过程，什么是数据思维？什么是机器学习思维？\n",
    "\n",
    "- 数据思维：利用数据之间的关联性，去分析事物之间的关联性。。。。。。\n",
    "- 机器学习思维：利用得到的数据集，制定某种规则，利用计算机来计算，得出所需要的输出。用得到的输出再反过来对数据集进行操作。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional)使用第4课讲解的 edit distance，在涉嫌抄袭的文章中，找到其重复的文字与被修改过的文字。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step11: 利用第8课讲述的新模型，进行操作，感受其中不同的参数、模型对性能的影响。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
